10. Summary
Summary
)](img/screen-shot-2018-07-17-at-4.44.10-pm.png)
REINFORCE increases the probability of "good" actions and decreases the probability of "bad" actions. ( Source )
### What are Policy Gradient Methods?
- Policy-based methods are a class of algorithms that search directly for the optimal policy, without simultaneously maintaining value function estimates.
- Policy gradient methods are a subclass of policy-based methods that estimate the weights of an optimal policy through gradient ascent.
- In this lesson, we represent the policy with a neural network, where our goal is to find the weights \theta of the network that maximize expected return.
### The Big Picture
-
The policy gradient method will iteratively amend the policy network weights to:
- make (state, action) pairs that resulted in positive return more likely, and
- make (state, action) pairs that resulted in negative return less likely.
### Problem Setup
- A trajectory \tau is a state-action sequence s_0, a_0, \ldots, s_H, a_H, s_{H+1} .
- In this lesson, we will use the notation R(\tau) to refer to the return corresponding to trajectory \tau .
- Our goal is to find the weights \theta of the policy network to maximize the expected return U(\theta) := \sum_\tau \mathbb{P}(\tau;\theta)R(\tau) .
### REINFORCE
- The pseudocode for REINFORCE is as follows:
-
Use the policy
\pi_\theta
to collect
m
trajectories
{ \tau^{(1)}, \tau^{(2)}, \ldots, \tau^{(m)}}
with horizon
H
. We refer to the
i
-th trajectory as
\tau^{(i)} = (s_0^{(i)}, a_0^{(i)}, \ldots, s_H^{(i)}, a_H^{(i)}, s_{H+1}^{(i)}).
-
Use the trajectories to estimate the gradient
\nabla_\theta U(\theta)
:
\nabla_\theta U(\theta) \approx \hat{g} := \frac{1}{m}\sum_{i=1}^m \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) R(\tau^{(i)})
-
Update the weights of the policy:
\theta \leftarrow \theta + \alpha \hat{g}
- Loop over steps 1-3.
### Derivation
- We derived the likelihood ratio policy gradient : \nabla_\theta U(\theta) = \sum_\tau \mathbb{P}(\tau;\theta)\nabla_\theta \log \mathbb{P}(\tau;\theta)R(\tau) .
-
We can approximate the gradient above with a sample-weighted average:
\nabla_\theta U(\theta) \approx \frac{1}{m}\sum_{i=1}^m \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta)R(\tau^{(i)}).
-
We calculated the following:
\nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta) = \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta (a_t^{(i)}|s_t^{(i)}).
### What's Next?
- REINFORCE can solve Markov Decision Processes (MDPs) with either discrete or continuous action spaces.